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Abstract 

With the widespread availability of cellphones and cam¬ 
eras that have GPS capabilities, it is common for images 
being uploaded to the Internet today to have GPS coordi¬ 
nates associated with them. In addition to research that 
tries to predict GPS coordinates from visual features, this 
also opens up the door to problems that are conditioned on 
the availability of GPS coordinates. In this work, we tackle 
the problem of performing image classification with loca¬ 
tion context, in which we are given the GPS coordinates for 
images in both the train and test phases. We explore differ¬ 
ent ways of encoding and extracting features from the GPS 
coordinates, and show how to naturally incorporate these 
features into a Convolutional Neural Network (CNN), the 
current state-of-the-art for most image classification and 
recognition problems. We also show how it is possible to 
simultaneously learn the optimal pooling radii for a subset 
of our features within the CNN framework. To evaluate our 
model and to help promote research in this area, we identify 
a set of location-sensitive concepts and annotate a subset of 
the Yahoo Flickr Creative Commons lOOM dataset that has 
GPS coordinates with these concepts, which we make pub¬ 
licly available. By leveraging location context, we are able 
to achieve almost a 7% gain in mean average precision. 

1. Introduction 

As Figure 1 shows, it is sometimes hard even for humans 
to recognize the content of photos without context. Just by 
looking at the photos we can conclude that all these exam¬ 
ples can reasonably be of snow. Consider, however, that (a) 
was taken at the Bonneville Salt Flats in Utah, (c) and (d) 
were taken in Death Valley and Palo Alto, respectively, both 
of which are areas in California that never see snow, and 
(b) was taken in New Hampshire, where snow storms are 
common. With this information in hand, it is much easier 
to correctly deduce that (b) is the only image that actually 
contains snow. 

Motivated by this observation, we tackle the problem of 
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Figure 1. Which of these are images of snow? Just by looking at 
the images, it may be difficult to tell. However, what if we knew 
that (a) was taken at the Bonneville Salt Flats in Utah, (b) was 
taken in New Hamsphire, (c) was taken in Death Valley, California 
and (d) was taken near Palo Alto, California? Image credits given 
in supplementary material. 

image classification with location context. In particular, we 
are interested in classifying consumer images with concepts 
that commonly occur on the Internet, ranging from objects 
to scenes to specific landmarks, as these are the things that 
people often take pictures of, and the Internet is the largest 
source of geotagged images. Building on the CNN archi¬ 
tecture introduced in [25], the basis for most state-of-the- 
art image classification and recognition results, we address 
how to represent and incorporate location features into the 
network architecture. This is not an easy problem, as we 
have found that naive approaches such as concatenating the 
GPS coordinates into the classifier, or leveraging nearby im¬ 
ages as a Bayesian prior result in almost no gain in perfor¬ 
mance. However, knowing the GPS coordinates allows us 
to utilize geographic datasets and surveys that have been 
collected by various institutions and agencies. We can also 
leverage the large amount of data on the Internet tagged 
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with GPS coordinates in a data-driven fashion. 

In summary, the contributions in this paper can be orga¬ 
nized into three parts. 

Constructing effective location features from GPS co¬ 
ordinates. We propose 5 different types of features that 
extend upon the latitude and longitude coordinates that are 
given to us, and perform a comprehensive evaluation of the 
effectiveness of each feature. 

Network architectures for incorporating location fea¬ 
tures. We show how to incorporate these additional fea¬ 
tures into a CNN [25]. This allows us to learn the visual 
features along with the interactions between the different 
feature types in a joint framework. In addition, we also 
show how we can simultaneously learn the parameters re¬ 
quired for constructing a subset of our features in the same 
framework, giving us improved performance and a better 
understanding of what the network is learning. 

YFCClOOM-GEOlOO dataset. We introduce annota¬ 
tions for a set of location sensitive concepts on a subset of 
the Yahoo Flickr Creative Commons lOOM (YFCCIOOM) 
dataset [4], which we denote as the YFCClOOM-GEOlOO 
dataset, and make our annotations publicly available. This 
dataset consists of 88,986 images over 100 classes, and al¬ 
lows us to evaluate our models at scale. 

2. Related Work 

There is a large body of work that focuses on the prob¬ 
lem of image geolocation, such as geolocating static cam¬ 
eras [22], city-scale location recognition [36], im2gps [20, 
42], place recognition [19, 39], landmark recognition [10, 
13, 33, 43], geolocation leveraging geometry informa¬ 
tion [10, 21, 30, 35], and geolocation with graph-based rep¬ 
resentations [12]. More recent works have also tried to sup¬ 
plement images with corresponding data [/, 8, 26, 29, 31], 
such as digital elevation maps and land cover survey data, 
which we draw inspiration from in constructing our fea¬ 
tures. In contrast to these works, we assume we are given 
GPS coordinates, and use this information to help improve 
image classification performance. 

In addition, several works also explore other aspects of 
images and location information, such as 3D cars with lo¬ 
cations [32], organizing geotagged photos [14], structure 
from motion on Internet photos [37], recognizing city iden¬ 
tity [44], looking beyond the visible scene [24], discovering 
representative geographic visual elements [16, 27], predict¬ 
ing land cover from images [28], and annotation enhance¬ 
ment using canonical correlation analysis [11]. 

Most similar are works that leverage location informa¬ 
tion for recognition tasks [5, 6, 9, 41]. The work of [5] 
tackles object recognition with geo-services on mobile de¬ 
vices in small urban environments. The work of [6] uses 


available Geographic Information System (GIS) databases 
by projecting exact location information of traffic signs, 
traffic signals, trash cans, fire hydrants, and street lights 
onto images as a prior. The work of [9] uses bird sightings 
to estimate a spatio-temporal prior distribution to help im¬ 
prove fine-grained categorization performance. The work 
of [41] leverages season and location context in a prob¬ 
abilistic framework to help improve region recognition in 
images. Our work differs in that we are interested in rec¬ 
ognizing a wide range of concepts present on the Internet 
beyond birds [9], small sets of specific urban objects [5, 6] 
or generic region types [41], and constructing features that 
are not specific to a particular class or source of GIS in¬ 
formation. In addition, we exhaustively evaluate ways of 
incorporating these features into a CNN, and we propose a 
way to parameterize the geo-features and extend the back- 
propagation algorithm to allow the net to learn the most 
discriminative geo-feature parameters. We also introduce 
a large-scale geotagged dataset collected from real-world 
images to train our models and effectively evaluate perfor¬ 
mance. 

Also closely related are the numerous works on context, 
which have shown to be helpful for various tasks in com¬ 
puter vision [15, 40]. We leverage contextual information 
by considering the GPS coordinates of our images and ex¬ 
tracting complementary location features. 

3. Our Approach 

Similar to standard image classification problems, we 
are given a set of n training images {/i, / 2 ,..., In} with 
associated class labels ^ 2 , • • •, where ^ G C is the 
set of classes we are trying to predict. In addition to the 
images, we are also given the GPS coordinates for each 
image {{longi.lati), {long 2 , lat 2 ),..., {longn, lain)}, 
where longi is the longitude and lati is the latitude for 
image i. Note that the GPS coordinates are given in both 
the training and testing phase, and our goal is to predict the 
class labels given both the image and the GPS coordinates. 
In this paper, we focus on images taken within the contigu¬ 
ous United States, but the majority of our features can be 
trivially extended to encompass the entire world. 

3.1. Neural network architecture 

We build on the CNN model introduced in [25], as this 
model and extensions to it are commonly used benchmarks 
in image classification and recognition [18, 34, 38]. For 
more details on the network architecture, we refer the reader 
to [25]. To incorporate location features into the network, 
we add a layer to concatenate the different feature types be¬ 
fore the softmax layer, as shown in Figure 2. This makes 
intuitive sense, as the lower layers of the CNN model are 
aimed at learning effective image filters and features, and 
we are interested in incorporating our features later on at a 
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Figure 2. Our CNN architecture. The pink rectangles denote convolutional layers, the yellow rectangles denote normalization layers, the 
blue rectangles denote pooling layers, the grey rectangles denote fully connected layers, and the green rectangles denote concatenation 
layers. The final fully connected layer is the softmax layer. Our model is given as input an image and its associated longitude and latitude 
coordinates. The image network denoted by the magenta box is the network architecture introduced in [25]. 


higher semantic level. In addition, we also experiment with 
adding additional depth using fully connected layers before 
and after the concatenation layer, denoted by the pre-cat and 
post-cat layers in Figure 2, and perform comprehensive ex¬ 
periments detailed later in the paper. 

With this architecture, we turn to the problem of ex¬ 
tracting location features. For each image i, we construct 
a set of features that effectively represent contextual infor¬ 
mation about the location specified by the GPS coordinates 
{longi^ lati). To do this, we utilize the wealth of geographic 
datasets and surveys collected by various agencies that doc¬ 
ument a large variety of statistics about each location, rang¬ 
ing from surveys on age and education, to geographic fea¬ 
tures such as elevation and precipitation. We also utilize the 
large amounts of geotagged data available on the Internet 
such as images and textual posts. 

3.2. GPS encoding feature 

The actual GPS coordinates are very fine location indica¬ 
tors, making it difficult for the classifier to effectively use. 
To make better use of the coordinates, we grid the contigu¬ 
ous United States into a rectangular grid with a latitude to 
longitude ratio of and construct an indicator vector for 
each image i that indicates which grid cell the GPS coor¬ 
dinate (longijati) falls into, resulting in a feature vector 
with dimension equal to the total number of cells in the grid. 
The aspect ratio is chosen so that each grid cell is roughly a 
square. We used rectangular grids up to 100x200, resulting 
in 25x25km square cells, limited by the computational time 
and memory for even larger grids. 

3.3. Geographic map feature 

There exist many different types of geographic maps and 
datasets that provide detailed information about each GPS 


Figure 3. Example geographic map of precipitation in the United 
States [2], with darker colors roughly indicating larger values of 
average precipitation. Regions with more rainfall may give rise to 
images that more commonly contain objects such as umbrellas. 


coordinate in the form of a colored map, with different col¬ 
ors representing different geographic features. In particular, 
Google Maps [2] is one of many online sites that stores a 
large set of such maps, with an example shown in Figure 3. 
We use 10 different types of maps from Google Maps: av¬ 
erage vegetation, congressional district, ecoregions, eleva¬ 
tion, hazardous waste, land cover, precipitation, solar re¬ 
source, total energy, and wind resource. Since each map 
uses different colors to represent the value of a feature at a 
particular location, for each image i we take the normalized 
pixel color values in a 17x17 patch around the GPS coordi¬ 
nate {longi^lati) for each map type, and concatenate these 
to form a 8670 dimensional feature. Intuitively, map fea¬ 
tures such as precipitation may tell us how likely it is to see 
an umbrella in a picture, while indicators such as elevation 
may tell us how likely it is for us to see snow. 


















































Figure 4. To build the hashtag context features, we look at the 
distribution of hashtags around each GPS coordinate by finding 
Instagram images tagged with relevant hashtags and GPS coordi¬ 
nates. For each hashtag, we pool over circles of different radii, 
counting the number of times each hashtag (blue/magenta stars) 
appears within a particular radius. 

3.4. ACS feature 

Given the GPS coordinate {longi^lati) for image i, we 
can perform reverse geocoding to obtain the corresponding 
zip code. This allows us to tap into the rich source of ge¬ 
ographic surveys organized by zip code, collected by agen¬ 
cies like the United States government. We use the Ameri¬ 
can Community Survey (ACS) [1], an ongoing survey that 
provides yearly data with statistics on age, sex, race, fam¬ 
ily/relationships, income and benefits, health insurance, ed¬ 
ucation, veteran status, disabilities, work status, and living 
conditions, all organized by zip code and pooled over a 5 
year period. We treat each statistic as a feature, and collect 
them into a vector resulting in a 21,038 dimensional feature. 
Intuitively, statistics such as age may tell us how likely it is 
to see toys in a picture, while statistics such as income may 
tell us how likely it is for us to see expensive cars. 

3.5. Hashtag context feature 

The aforementioned geographic map and ACS features 
are based on map and survey data collected about a particu¬ 
lar location from various agencies. However, a large source 
of data lies directly on the Internet, where millions of im¬ 
ages are uploaded daily, many of which tagged with GPS 
coordinates. We propose a set of data-driven features that 
are able to make use of the images on Instagram [3]. 

Intuitively, for each image i associated with GPS coor¬ 
dinate {longi, lati), our goal is to capture the distribution 
of hashtags in the vicinity. Hashtags that commonly occur 
near image i can help indicate the types of things that occur 
in the real-world context of image i, giving us contextual in¬ 
formation about what is in the image. We start by defining a 
set of hashtags H that we are interested in. For a particular 
hashtag h G obtain images from Instagram with GPS 


coordinates and matching hashtag. Then, we define a set of 
radii IZ, and for each r G 7^, we pool over a circle of ra¬ 
dius r around {longi^lati) and count the number of images 
tagged with hashtag h that fall into the radius. As shown in 
Figure 4, this is done for each of the radii in IZ and each of 
the hashtags in H, resulting in a set of |H|x|7^| counts. 

To build features from these counts, we perform two 
types of normalization for the \1-L\ counts in each radius 
r e TZ. The first is normalization across hashtags, where 
we normalize each count by the sum of counts for all hash- 
tags within r. This normalization gives us an idea of the 
relative frequency of a particular hashtag in relation to the 
other hashtags that appear in the area, and normalizes for the 
density of photos in the area. The second is normalization 
within hashtag, where we normalize each count by the to¬ 
tal number of images we obtained from Instagram with the 
particular hashtag. This normalization gives us an idea of 
the relative frequency of a particular concept in relation to 
how often this concept appears in the entire United States. 
We perform both types of normalization and concatenate 
the feature vectors together to form the final feature vector, 
resulting in a 2x|?^|x|7^| dimensional feature. 

In our experiments, we set C = H, using the set of 
classes as the set of hashtags for simplicity, and set IZ = 
{1000, 2000,..., 10000}. To save computation time, we 
quantize all the GPS coordinates into a 25000x50000 grid, 
which results in approximately square grid cells each cov¬ 
ering a 100x100 meter area. 

3.6. Visual context feature 

The visual context feature is similar to the hashtag con¬ 
text feature, except in this case we would like to take ad¬ 
vantage of the visual signal around our GPS coordinate 
{longi^lati), and not just the has tags that have been tagged. 
To do this, we retrieve images from various online social 
websites with GPS coordinates, and for each image run a 
CNN with similar architecture to [25] to generate probabili¬ 
ties for 594 of the common types of concepts that appear on 
the Internet, such as “clothes”, “girl”, and “coffee”. The full 
list of 594 concepts is given in the supplementary material. 

Similar to the hashtag context feature, we pool the prob¬ 
abilities for each radius r £ 7Z around (longijati) by 
summing the probabilities of all the images that fall into 
the radius, individually for each concept, resulting in a set 
of 594x|7^| probabilities. Then, we perform the same two 
types of normalization and concatenate to form the final fea¬ 
ture vector, resulting in a 2x594x|7^| dimensional feature. 
We use the same set of radii 7Z and GPS grid quantization 
as the hashtag context feature. 

4. Learning the Optimal Pooling Radius 

In the previous section we introduced the hashtag context 
and visual context features. For both of these features, we 








explained how to build features from the aggregated hash- 
tag counts and concept probabilities by concatenating to¬ 
gether normalized histograms pooled over a set of radii IZ. 
However, we don’t expect that all radii are informative. For 
example, for hashtags or concepts that are rare, even being 
a few kilometers away may be an important indicator. Sim¬ 
ilarly, certain hashtags that are common may require being 
extremely close to truly pinpoint the location. 

Radius learning layer. To address this, we show how to 
construct a layer in the CNN that automatically learns the 
optimal radius used for pooling, which we denote as the 
radius learning layer. Learning the optimal radius is poten¬ 
tially useful in many ways. First, by focusing on the im¬ 
portant radii with informative features, we can avoid over¬ 
fitting. Second, we can visualize the radii that we have 
learned, providing insight into what the CNN is learning. 

We start by considering the radius for a single hash- 
tag/concept h, and fit a function H(^iongi,iati),h{p) over the 
histogram that returns the value of the histogram feature for 
hashtag h and radius p at location given by {longi^lati). 
There are several ways to fit such a function, but for sim¬ 
plicity we use the histogram values computed over IZ from 
the previous section and fit a piece-wise linear approxima¬ 
tion to the values. We do this for all the hashtags, concepts, 
and both types of normalization schemes to obtain a set of 
2 • ('H + 594) histogram functions for each training image i. 

The outputs of these histogram functions are treated as 
input features to the CNN in place of the concatenated 
histograms, with a radius parameter ph for each hash- 
tag/concept that selects the value of the function to treat as 
input to the neural network. When computing the gradient 
for backpropagation, we backpropogate the gradient of the 
error E into the gradient of the histogram function H : 

dE ^ dE 

dph (P/i) dPh 

The first term in the RHS is the error derivative propa¬ 
gated to the radius learning layer from the network archi¬ 
tecture above it, and the second term is the derivative at ph 
of the histogram function Hpongi,iati),h- Since we use a 
piece-wise linear approximation to fit the histogram func¬ 
tion, the second term is easily computed by taking the slope 
between the two nearest points in IZ. Although we could fit 
more complicated functions, we found the linear approxi¬ 
mation to be fast and sufficient, as we aggregate gradients 
over all the training examples. Since hashtag s/concepts may 
have multiple radii and weightings between the radii that are 
informative, we replicate the radius learning layer multiple 
times for each histogram function. 


5. Dataset 

To evaluate our method, we use the recently released 
Yahoo Flickr Creative Commons lOOM (YFCCIOOM) 
dataset [4], which consists of 100 million Creative Com¬ 
mons copyright licensed images from Flickr. Of the 100 
million images, approximately 49 million are geotagged 
with GPS coordinates, which makes this dataset particularly 
suitable for evaluating our task because of its unprecedented 
scale. Also provided with the images are tags for the images 
produced by users on Flickr, which we use as a first step to 
identify images that contain a particular class. However, be¬ 
cause the tags are very noisy, we must manually verify and 
discard images that do not actually contain the classes we 
were interested in. As mentioned before, we focus only on 
images geotagged within the contiguous United States. 

Selecting location-sensitive classes. One of the problems 
we have to deal with is selecting classes that are likely to be 
location-sensitive, and will benefit from our location con¬ 
text features. This is important because there are certainly 
classes that are not, and adding these additional features 
may just cause the classifier to overfit. Practically speak¬ 
ing, we also need a way of limiting the number of classes to 
a manageable number we can annotate. 

To address this issue, we use a simple data-driven 
method for selecting classes. Using a large set of images 
from Instagram, we estimate the discrete geospatial dis¬ 
tribution P of all images by first gridding the contiguous 
United States into a fine grid, and then counting the num¬ 
ber of images that fall into each grid cell and normalizing 
to create a valid probability distribution. Then, we obtain a 
large list of classes through commonly occuring Instagram 
hashtags, and for each class c we estimate the geospatial 
distribution Qc of images tagged with c in a similar man¬ 
ner. With these two distributions, we compare their similar¬ 
ity with the Kullback-Leibler (KL) divergence: 

Intuitively, we would like to find classes that do not 
exhibit a geospatial distribution similar to the distribution 
of all images, as this would suggest that they have some 
location-sensitive properties. The KL divergence does this 
by giving us a measure of the difference between the two 
probability distributions, and we select the top 100 classes 
with the highest KL divergence. In practice, given a new 
class c, we can simply compute Dkl{P\\Qc) and thresh¬ 
old to determine whether or not the class will benefit from 
our additional location features. Examples of the geospatial 
distributions are shown in Figure 5. 

YFCClOOM-GEOlOO dataset. Using the top 100 classes 
selected with the highest KL divergence, we manually ver- 
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Figure 5. Instagram hashtag distributions for various classes in the contiguous United States. Although we can see interesting patterns such 
as beach hashtags near coasts and the outline of the Appalachian Mountains in the mountain hashtags, there is a great deal of noise. 



Figure 6. The geographic distribution of the 88,986 images in the 
YFCClOOM-GEOlOO dataset that we introduce. 

ified and annotated a large set of the YFCCIOOM images 
that were noisily tagged with these classes by Flickr users. 
This resulted in a dataset of 88,986 images, with at least 
100 images per class, which we denote as the YFCCIOOM- 
GEOIOO dataset and will make publicly available. The 
classes we selected range from objects to places to scenes, 
with examples such as ‘autumn’, ‘beach’, and ‘whale’ that 
illustrate the diversity of classes we are trying to classify. 
Figure 6 visualizes the distribution of GPS coordinates for 
the images in the dataset. The full list of classes is given in 
the supplementary material. 

6. Results 

We randomly divide the YFCClOOM-GEOlOO dataset 
into an 80% training set and 20% test set. We further leave 
out a small portion of the training set as a validation set for 
parameter tuning in our models. 

Implementation details. Following [25], we train our 
models using stochastic gradient descent with momentum 
of 0.9 and a 0.005 weight decay. We use a learning rate 
of 0.1, and run approximately 30 passes through our data, 
decreasing the learning rate by 0.1 every 10 passes. We 
use a 0.5 dropout ratio for all of our fully connected layers. 
Since our training data is relatively small, we initialize the 
parameters in the Image Network portion of the model (see 
Figure 2) by pre-training it on a large set of Instagram im¬ 
ages, and then freezing the pre-trained parameters into our 


Method 

Mean AP 

Acc@l 

Acc@5 

Image only 

36.82% 

39.45% 

70.15% 

Image + GPS coordinates 

36.83% 

39.47% 

70.23% 

Image + GPS encoding 10x20 

38.58% 

41.48% 

72.39% 

Image + GPS encoding 100x200 

38 . 89 % 

41 . 67 % 

72 . 47 % 

Image + Geographic map feature 

37.70% 

40.28% 

70.79& 

Image + ACS feature 

40 . 41 % 

42 . 79 % 

73 . 84 % 

Image + Hashtag context feature 

39.86% 

42.27% 

73.38% 

Image + Visual context feature 

38.81% 

41.53% 

72.31% 

Image only (SVM) 

33.41% 

36.56% 

60.05% 

Image + All features (SVM) 

34.61% 

38.06% 

62.88% 

Image + All features kernel (SVM) 

35 . 12 % 

38 . 57 % 

63 . 74 % 

Image + Flickr prior lONN 

24.15% 

25.36% 

36.46% 

Image + Flickr prior lOONN 

33.38% 

35.45% 

60.62% 

Image + Flickr prior lOOONN 

36 . 30 % 

37 . 86 % 

68 . 57 % 

Image + Instagram prior 1000km 

24.03% 

22.70% 

38.23% 

Image + Instagram prior 4000km 

31.96% 

30.62% 

58.69% 

Image + Instagram prior 8000km 

33 . 08 % 

30 . 67 % 

60 . 13 % 


Table 1. Results comparing various baseline methods. For the 
CNN models we do not use pre-cat and post-cat layers. 

model. Note that we could further fine-tune these parame¬ 
ters as well, but chose not to for speed concerns. 

Performance metrics. To evaluate our models, we use 
three different performance metrics. In addition to the stan¬ 
dard metric of mean average precision (AP), we also in¬ 
clude results for normalized accuracy® 1 and normalized 
accuracy® 5, motivated by their use in recent papers [23] as 
well as the ImageNet classification challenge [34]. The nor¬ 
malized accuracy ®k measure indicates the fraction of test 
samples that contained the ground truth label in the top k 
predictions, normalized per class to adjust for differences in 
the number of images per class. 

6.1. Baseline methods 

We evaluate the benefit of each proposed feature, shown 
in the top two sections of Table 1, without the pre-cat layer 
and post-cat layers as a baseline (see Figure 2). Not sur¬ 
prisingly, using the GPS coordinates does not yield any sig¬ 
nificant gain in performance, as they do not make sense in 
the context of a linear classifier. Using the GPS encoding 
features, we get much better performance, with a gain of 
around 2% in all performance measures. We can also see 
that for each feature, we obtain performance gains from 
concatenating the features with the baseline image features, 
which shows they provide complementary information. In 
















particular, the ACS feature yields the largest increase in per¬ 
formance, with almost a 4% gain in all performance mea¬ 
sures. 

Support vector machines. We perform experiments us¬ 
ing Support Vector Machines (SVM) and kernelizing our 
features. We use kernel averaging to combine features, as 
it has been shown to perform on par with more complicated 
methods [17]. In the middle section of Table 1, we show 
results using a multi-class hinge loss SVM classifier and 
cross-validating the regularization parameter. For the naive 
combination, we use linear kernels for all features, and for 
the combination, we compute x^ kernels for the his¬ 
togram features (hashtag context, visual context), and use 
linear kernels for the rest due to dimensionality concerns. 
In general, we found the SVM to perform worse than the 
softmax classifier. Kernelizing the histogram features with 
X^ kernels performs better than using just linear kernels, but 
still doesn’t exceed the performance of the softmax. 

Bayesian priors. Following the approach used in [9], we 
also try incorporating location context as a Bayesian prior. 
Using Bayes’ rule, the probability of predicting class c 
given image li and location {longi^lati) can be written as: 


P(c|/^, lonQi^ loti) 


P{Ii, longi^ lati\c)P{c) 
P{Ii,longi,lati) 


( 3 ) 


Assuming the image and location are conditionally inde¬ 
pendent given the class, further applying Bayes’ rule and 
removing terms that do not depend on c, we obtain: 


P{c\Ii^longi^lati) (x 


PWi) 

P{c) 


P{c\longi,lati) 


( 4 ) 


In our experiments, we assume a uniform prior over the 
classes for P(c). We tried two different approaches to com¬ 
puting the location prior P{c\longi^lati), with results given 
in the bottom two sections of Table 1 . In the Flickr prior, 
for each test image we find the /c-nearest-neighbors (k-NN) 
from the training set in GPS space and use their labels to es¬ 
timate a distribution for the location prior. In the Instagram 
prior, for each of our test images we take the histogram 
computed in the hashtag context feature for a certain radius 
r, and use the normalized histogram as the distribution for 
the location prior. Although the overall results do not im¬ 
prove for either method, it’s interesting to note that for some 
classes such as “disneyland”, results improve by more than 
45% mean AP for both types of priors. However, for the 
majority of the classes, the location prior hurts rather than 
helps, causing an overall decrease in performance. 


6.2. Architectures for feature combination 


We evaluate the various architectures for combining the 
features together, and evaluate the effect of varying levels of 


Method 

Mean AP 

Acc@l 

Ace @5 

Image only 

36.82% 

39.45% 

70.15% 

Image + All features with -/- 

37.97% 

40.19% 

70.67% 

Image + All features with 128/- 

42.22% 

44.76% 

75.74% 

Image + All features with 256/- 

42.34% 

44 . 82 % 

75 . 86 % 

Image + All features with 512/- 

42.20% 

44.43% 

75.53% 

Image + All features with 1024/- 

41.60% 

43.98% 

75.16% 

Image + All features with 256/4096 

43 . 28 % 

43.74% 

74.30% 


Table 2. Results when concatenating all features and varying the 
pre-cat and post-cat layers. The X/Y notation refers to the dimen¬ 
sionality X of the pre-cat layers and Y of the post-cat layer, with - 
representing no pre-cat or post-cat layer. 


Method 

Mean AP 

Acc@l 

Ace @5 

Image only 

36.82% 

39.45% 

70.15% 

Image + Hashtag context feature 

39.86% 

42.27% 

73.38% 

Image + Hashtag context feature RL5 

40.19% 

42.52% 

73.57% 

Image + Hashtag context feature RLIO 

40 . 80 % 

43 . 10 % 

74 . 15 % 

Image + Visual context feature 

38.81% 

41.53% 

72.31% 

Image + Visual context feature RL5 

38.75% 

41.31% 

72.08% 

Image + Visual context feature RLIO 

39 . 07 % 

41 . 78 % 

72 . 48 % 

Image + All features with 256/- 

42.34% 

44.82% 

75.86% 

Image + All features with 256/- RLIO 

42.91% 

45 . 17 % 

76 . 09 % 

Image + All features with 256/4096 

43.28% 

43.74% 

74.30% 

Image + All features with 256/4096 RLIO 

43 . 78 % 

44.14% 

74.70% 


Table 3. Results through learning the optimal pooling radius. RL5 
and RLIO refer to the number of replicas (5,10) of the radius learn¬ 
ing layer used to replace the concatenated histograms. 

depth before and after the concatenation layer in the model. 
Results are shown in Table 2. The top section of the table 
shows results for adding additional depth in the pre-cat layer 
for each individual feature, and the bottom section shows 
the result with a 4096 dimensional post-cat layer. We make 
all comparisons to the “Image only” model from Table 1, 
which we now refer to as the baseline image model. 

Pre-cat layer. From the results, we see that simply con¬ 
catenating the features together does not result in a signifi¬ 
cant increase in performance, likely because the feature di¬ 
mension is large, and the model is overfitting. Thus, we 
introduce the pre-cat layers to capture relationships within 
each feature type, and to serve as dimensionality reduction. 
Although they perform comparably, the 256 dimensional 
layer seems to strike the best balance between performance 
and the number of parameters to learn, obtaining almost a 
6% gain in performance across all performance measures. 
We also tried adding additional depth beyond a single layer, 
but found that this did not help significantly and drastically 
increased the number of parameters to learn. 

Post-cat layer. We also perform experiments with the 
post-cat layer to capture relationships between the different 
feature types. We found that a 4096 dimensional fully con¬ 
nected layer seems to help increase mean AP slightly, but 
decreases both normalized accuracy rates due to overfitting. 
Again, as observed previously, adding additional depth here 
also causes the model to overfit, and decreases performance. 
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Figure 7. Example results comparing the baseline image model to our best model (256/- RLIO), with correct predictions in green and 
incorrect predictions in red. Image credits given in supplementary material. 



Figure 8. AP difference between our best model (256/- RLIO) and 
the baseline image model for the 20 best and worst classes. 

Buildings Coast Foliage 



Figure 9. Visualizations of the learned radii for three classes from 
our best model (256/- RLIO), sorted from smallest to largest. 

6.3. Learning the optimal pooling radius 

In the previous sections, we concatenated histograms 
computed at varying radii for the hashtag context and vi¬ 
sual context features. Since there are often multiple radii 
and weightings between the radii that are most informative, 
we replace the concatenated histograms with multiple repli¬ 
cas of radius learning layers, with results shown in Table 3. 
In the top two sections, we observe large improvements for 
the hashtag context feature, and mild improvements for the 
visual context feature in a controlled setting with no pre-cat 
and post-cat layers. In the bottom section, we are able to ob¬ 
tain an additional 0.5% gain in mean AP by using the radius 
learning layers for both of our best models from the previ¬ 
ous section. We found again that adding the post-cat layer 


causes the model to slightly overfit, and thus use the 256/- 
RLIO model as our best model in the following analyses. 
Figure 7 shows some interesting examples and predictions. 

Best and worst classes. In Figure 8, we show the top 20 
best and worst performing classes compared to the base¬ 
line image model. Location-specific classes like “disney- 
land”, “casino”, and “alcatraz” see a large increase in per¬ 
formance, as they are confined to one or a small number of 
locations in the United States. On the other hand, some of 
the worst performing classes are car brands, which suggests 
that fine-grained car classes are not very location-specific, 
or not handled well in our features and model. However, 
since our method for selecting location-sensitive concepts 
was data-driven and unsupervised, they were included. 

Learned radius parameters. We visualize the radius pa¬ 
rameters learned for several classes in Figure 9. We found 
that for most concepts, the 10 different replicas of the radius 
learning layers typically converge to 3 or fewer different 
radii, like the “coast” and “foliage” classes, which suggests 
that certain radii are indeed more informative. Occasion¬ 
ally, some classes like “building” learn almost all different 
radii, possibly because within urban areas the abundance of 
buildings makes smaller radii important, and within rural 
areas larger radii become important. 

7. Conclusion 

In this paper, we introduce the problem of image classifi¬ 
cation with location context. To represent location context, 
we propose 5 features that help capture context about a par¬ 
ticular location, and show how to incorporate them into a 
CNN model. For features that require pooling over radii, we 
show how to automatically learn the optimal radius within 
the same framework, allowing us to obtain better perfor- 



























mance and a deeper understanding into the network param¬ 
eters. Furthermore, we introduce and make publicly avail¬ 
able the YFCClOOM-GEOlOO dataset, which we manually 
annotate to obtain class labels for geotagged images. 

For future work, we would like to explore taking ad¬ 
vantage of other aspects of images that are now becoming 
widely available, such as time and date taken or the social 
relationships between the users. 
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